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6. FUTURE WORK 

My database still needs some taxonomic consolida¬ 
tion. For example, 1 still have to assess the type- 
species of many of the genera as well as the names 
involved in many nomina nova. 

An immediate target is to finish entering the classifi¬ 
cation of Wolters, but the ultimate goal of my taxo¬ 
nomic referential database is to contain all names of 
birds that have been published until today. Therefore 
the next logic step would be to computerize the Cata¬ 
logue of Birds in the British Museum (Sharpe 1874- 
1898). Given the different quality of the taxonomic 
information varying from author to author of the 27 
volumes of this enormous work, any added name 
needs careful checking of its reference. It is difficult 
to estimate the time outlay for this work. 

A different and/or additional way to increase the com¬ 
pleteness of my database is to continue entering type 
data of different museums world-wide. A first trial 
with the digitalized type data of the NHM was a great 
success: more than 80% of all names could be 
matched with taxa already entered into the database. 

Obviously, further co-operation with additional muse¬ 
ums possessing types is welcome. In return each 
museum would be allowed to access my database, 
which will be housed on the MNHN web site. 
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Abstract. Natural history museum collections provide the basic documentation of life on Earth. As such, they represent the 
critical and unique resource by which that life may be understood, and have immense economic and scientific importance. 
Nevertheless, particularly in recent decades, natural history museums have seen less and less attention - and resources - in 
spite of their importance. A scries of new efforts, however, aim to recoup that prominence via community efforts to unite 
data resources towards a vastly improved understanding of biodiversity and its implications. The Species Analyst represents 
an effort to unite natural history collections databases worldwide to this end: 77 institutions now cooperate or are commit¬ 
ted to cooperate in serving records of 51 million natural history museum specimens to users worldwide, and has seen more 
than 700,000 users to date. 
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1. INTRODUCTION 

Computerization of ornithological collections is in¬ 
creasingly considered a priority for curators and staff 
of natural history museums. A common quandary, 
however, is how and why to get started. The curator is 
presented with a bewildering variety of databasing 
programs, some especially designed for specimen 
records, and others off-the-shelf generic database pro¬ 
grams that can be customized for any use. Choice of 
a platform, choice of data fields, and choice of com¬ 
puterization strategy all become critical - and difficult 
- consideration. Unfortunately, these considerations 
can often seem so complex that computerization 
efforts are not initiated. 

Moreover, presented with a thousand and one other 
priorities of collections building, specimen conserva¬ 
tion, institutional politics, and research efforts, and 
given the significant time investment that computeri¬ 
zation requires, the question arises as to whether the 
result is worth the time. That is, one must consider 
what are the benefits of computerization, and how 
much do they benefit the collection, the curator, and 
the broader community. 

The purpose of this contribution is to provide a ration¬ 
ale for computerizing bird collections as a critical step 
forward in their care. Along the way, we review steps 
involved - a sort of minimum-standard guide to start¬ 
ing computerization efforts. Finally, we provide a 
series of examples of how computerizing collections 
data, and sharing those data across many institutions 
worldwide, benefits the collections themselves. 


2. WHY COMPUTERIZE A COLLECTION? 

Databasing or computerizing a collection is a lot of 
work, and may easily absorb years of effort. So why 
do it? Several reasons argue strongly for taking this 
step. A partial list follows: 

- Get to know your collection - a sweep through the 

whole collection, drawer by drawer, gives a 
unique knowledge of a particular collection. 

- Discover important specimens - many fascinating 

discoveries have resulted from the specimen-by¬ 
specimen attention during computerization efforts, 
including species new to science, lost type speci¬ 
mens, important historical specimens, etc. 

- Detect problems - again, the specimen-by-specimen 

attention can help to detect serious problems that 
might otherwise not be noticed ... damage from 
insects or water, fading of plumages, drying of 
spirit specimens, etc. 

- New views of the collection — although we are famil¬ 

iar with summaries of collections in terms of tax¬ 
onomic completeness, and perhaps regional sum¬ 
maries, many new views of collections open when 
a collection is computerized, e.g., maps of the geo¬ 
graphic distribution of specimens, summaries of 
accessions over time, etc. 

- Save curatorial time - making summaries of hold¬ 

ings, preparing loan invoices, tracking down par¬ 
ticular specimens, and many other curatorial tasks 
are considerably more efficient when the collec¬ 
tion is available in database form. 

Standardize taxonomy - once data are in electronic 
form, comparing names against a standard list (e.g., 
the Peters’ check-list) can identify a first set of non- 
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standard names that require checking and updating. 
- Efficient information access - many questions and 
data requests that require hours or days of work for 
an uncomputerized collection will suddenly be¬ 
come feasible to answer in minutes, making possi¬ 
ble much more creative uses of the information in 
collections. For example, 

- What are your holdings of taxon X? 

What are your holdings from country X? 

Do you have specimens collected by person X? 

What is the history of specimen acquisition rates 
in your collection? 

And many more ... 

In short, computerization of a collection is a major 
undertaking, but ends up repaying the investment of 
time and effort many times over. 

3. CHOOSING A PLATFORM 

The first big question to be answered is about 
which platform (databasing program) to use. This 
decision becomes complex ... sometimes, museum 
administrators decide to force all collections in the 
museum to use the same program. Even if one has 
the freedom to choose, should one choose among 
the many programs that have been developed 
specifically for natural history museum specimens 
(BIOTA, BIOTICA, SPECIFY, etc.), or a generic 
program off the shelf (e.g., Microsoft Access, Ora¬ 


cle)? Regarding this choice, each option has its 
strengths and weaknesses (Table 1). In general, we 
would recommend the off-the-shelf option for 
small, old, or inactive collections, and the speci¬ 
men databasing programs for larger, data-rich, and 
very active collections. 

Regardless of this choice, one should insist on sev¬ 
eral minimum criteria for a databasing platform. 
These criteria are critical features of a program 
that must be fulfilled in order to avoid problems. 
As follows: 

Capacity for export to other, generic formats, 
particularly ASCII delimited format, to allow 
reporting, export to other programs, and porting 
to future technologies and platforms. 

- Compatible with Standardized Query Language 
(SQL), which permits many functionalities to 
be added to your database related to sharing 
data. 

Once a platform has been identified that fits the 
particular needs of a collection, and meets these 
basic requirements, then design of the computeri¬ 
zation effort can begin. 

If the reasoning outlined above suggests that the 
best solution to computerization is that of a more 
complex program specifically designed for natural 
history specimen data, then you should read about 
several of the programs that are available. Links to 
a number of such programs are presented in Table 2. 


Table 1: Summary of advantages and disadvantages of specialized versus generie programs as platforms for computeriz¬ 
ing bird eolleetions. 


NATURAL HISTORY MUSEUM SPECIMEN 

DATABASING PROGRAMS 

OFF-THE-SHELF GENERIC 

DATABASING PROGRAMS 

Advantages 

Designed specifically for specimen management 

Features such as authority lists, loan invoice reporting, etc. 

No customization or little customization required 

Most complex solutions specific to natural history speci¬ 
mens are tractable 

Long-term continuity of support from the company 

Easy availability of expert advice, given broad usage in 
many communities 

Simplest solutions are feasible 

Simple learning curve 

Disadvantages 

Can disappear long-term support often depends on a 
person - researcher or developer - who can decide not to 
support the program further, or who may decide not to 
update to newer versions (e.g., MUSE) 

Expert advice may be unavailable in a particular city 

May not permit very simple solutions to simple problems 
Steeper learning curve 

May need customization of program for intermediate-to- 
complex situations 

Not designed specifically for specimen data 

Complex features (e.g., reporting, authority lists) not auto¬ 
matically available 
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Table 2: Selected specialized programs designed specifi¬ 
cally for collections data. Provided are World Wide Web 
links for more information. 


Program 

URL 

SPECIFY 

http://usobi.org/specify/ 

Biotica 

http://www.conabio.gob.mx/biotica 
_ingles/distribucion_b. html 

BioLink 

http://www.biolink.csiro.au/ 

BIOTA 

http://viceroy.eeb.uconn.edu/biota 

KE EMu (not re¬ 
commended for 
integration via 
Species Analyst) 

http://www.kesoftware.com/ 


4. CHOOSING DATA FIELDS 

This step may prove to be the most critical of all in the 
process of computerization. With too many fields, 
time and ftlespace are wasted, whereas with too few, 
they will have to be added later or one will have to 
live without them. If an incorrect structure is chosen, 
the database may be forever handicapped by this 
design flaw. However, the challenge is reduced quite 
a bit with an understanding of a few basic ideas. Spec¬ 
imen data, in their simplest form, distill down to three 
linked sets of information about each specimen: 

— Taxonomic information - the taxonomic identity of 

the specimen 

— Geographic information - the geographic location 

of its collection 

— Detailed documentation of the specimen — time of 

collection, collector identity, museum catalogue 
number, sex, age, body mass, etc. 

Thinking in this manner, we can envision a structure 
for a specimen database that would capture this infor¬ 
mation optimally. Taxonomy and geography are both 
hierarchical concepts, and so we can represent them 
as such, which would make for three interacting sets 
of information (Fig. 1). 


In the simplest sense, then, even in a spreadsheet pro¬ 
gram such as Microsoft Excel, or (better still) as a sin¬ 
gle table in a database program such as Microsoft 
Access, one could use a straightforward single table 
that holds critical fields (see Table 3). This very sim¬ 
ple structure provides a clear, workable solution for 
small collections. In a more complex situation, in 
which more specimens are to be computerized, this 
structure can be made relational (Fig. 1) (that is, made 
up of several tables that interconnect). The advantage 
of a relational database structure is that elements of 
the database are entered only once: e.g., the locality 
descriptor for the 150 specimens collected at USA/ 
Kansas/Douglas Co./Lawrence/10 km E is entered 
only once, reducing the possibility of typographical 
errors. 


Table 3: Critical minimum set of fields for a simple col¬ 
lections database. 


Field 

Example 

Catalogue number 

15230 

Genus 

Cyanocitta 

Species 

cristata 

Subspecies 

cristata 

Date of collection 

24 October 1956 

Collector 

Fredrick E. Jones 

Sex 

Female 

Age 

Adult 

Body mass 

120 g 

Country 

USA 

State or province 

Ohio 

County 

Butler Co. 

Named place and directions from 

Oxford, 10 km E 


This sort of simple relational structure can be imple¬ 
mented in a program such as Microsoft Access with a 
few hours’ attention by a technician familiar with the 
program. The custom specimen database programs 
use a more complex relational struc¬ 
ture, but one that is in essence based 
on this overall backbone. Again, the 
more complex the demands that one 
will wish to place on the database 
(e.g., more complex queries, more 
detailed reporting, more specimens), 
the more complex the database struc¬ 
ture that will be required. For rela¬ 
tively simple applications, however, 
the simple flat file (single table) 
setup described above will often be 
adequate. 


Taxonomy: 


Detail: 

• Key for taxon - 

«- 

• Key for taxon 

• Subspecies 


• Key for location " 

• Species 


• Unique catalogue 

• Genus 


number 

• Family 


• Date of collection 

• Order 


• Collector 



• Sex 


• Age 


• Body mass 


Geography: 

Key for location 
Named place and 
directions from 
County 

State or province 
Country 
Region 


Fig. 1: Diagrammatic illustration of a simple relational database structure 
designed to link hierarchically organized geographic and taxonomic informa¬ 
tion with specific data regarding a particular specimen. 
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5. COMPUTERIZATION STRATEGY 

The next question to be faced is the strategy for com¬ 
puterization. This decision depends heavily on the 
exact situation of a collection. If, on the one hand, an 
excellent paper catalogue or card tile exists, one may 
wish to computerize directly from that, and then ver¬ 
ify the accuracy and completeness later from the 
actual specimens. If, on the other hand, a good card 
file or catalogue does not exist, or if many specimens 
may have been omitted (exchanged or deaccessioned) 
or not entered in the catalogue, then you may be bet¬ 
ter off computerizing directly from specimens. 

In general, two passes through the collection will be 
necessary as part of any computerization effort. The 
first will simply get each specimen’s data into the 
computer as efficiently as possible. The second will 
verify (1) the existence of the specimen, (2) that all 
data elements arc entered in the database, and (3) that 
all of the specimen’s data are correct as entered. This 
verification step, although labor-intensive, is critical 
to making the database a correct representation of the 
information contained in the specimens’ labels. 

All computerization efforts should involve the critical 
step of backing up data at regular intervals. Too many 
’impossible accidents’ have removed a year of work, 
and set a computerization effort back terribly. Back¬ 
ing up data should be done as permanently as possible 
... that is, compact disks are better than tloppy disks. 
It should also be done with redundancy: each time 
that you make a copy, if at all possible, it should not 
over-write the previous copy. This preservation of 
‘versions’ of the database allows one to go back a 
week or a month if some error appears in the data set. 
Finally, given the possibility of more catastrophic 
losses, the back-up copies should be stored off-site, 
preferably in several places. Excellent storage sites 
for these copies can include libraries or archives, or 
curators’ homes, or they can even be transferred via 
the Internet or via mail to another country. 

6. THE SPECIES ANALYST (TSA) 

The Species Analyst (http://speciesanalyst.net/) is a col¬ 
lection of software tools that permits integration of 
computerized collections data among institutions 
around the world into a distributed biodiversity infor¬ 
mation facility. For example, a user might wish to ask 
for records of any taxon from Yellowstone National 
Park or from Burma, or all specimens collected by 
Alexander von Humboldt, and retrieve information in a 
matter of seconds from 50 institutions around the world. 

TSA uses a hybrid of Z39.50 (an information transfer 
protocol developed about 20 years ago in the biblio¬ 
graphic community) and XML (a more modern and 
efficient protocol) to permit efficient query and 


retrieval of data. TSA may be accessed via a web por¬ 
tal that permits basic queries, or via extensions to 
Microsoft Excel (for retrieval of data in spreadsheet 
format) and ArcView (for retrieval of data as GIS 
coverages) (downloads available at http://speciesana- 
lyst.net/downloads). 

TSA currently integrates data sets from 22 institu¬ 
tions, for a total of 15 million specimen data records 
for over 50,000 species; a total of 58 institutions has 
committed to participation formally, which will take 
the total number of specimen records served to about 
50 million. A special strength at present is in ichthy¬ 
ological data, as FishNet (http://speciesanalyst.net/ 
fishnet/) has taken excellent advantage of TSA tech¬ 
nology to create a data facility linking most important 
computerized fish collections. Now funded is a paral¬ 
lel network for mammal collections data (MANIS, 
based at the Museum of Vertebrate Zoology; http:// 
elib.cs.berkeley.edu/manis/), and networks for her- 
petological and ornithological (expanded) specimen 
data are pending and in preparation, respectively. 

7. WHY SHARE DATA ONCE 
COMPUTERIZED? 

Above, we listed the first set of benefits of computer¬ 
ization of bird collections - namely, freer and more 
complete access to the information content of the 
specimens that make up the collection. These benefits 
are indeed considerable, and add enormously to a 
curator’s ability to take care of a collection. However, 
once data are computerized, if they are shared, and 
integrated with data from other collections around the 
world, an additional set of benefits accrues. 

In essence, a set of emergent properties comes into 
being once all (or nearly all) data are integrated for a 
particular taxon or region. We have come to appreci¬ 
ate these emergent properties as we have assembled 
the Atlas of Mexican Bird Distributions (Navarro & 
Peterson, in prep.), a centralized database now 
including the contents of more than 60 natural history 
museum collections of Mexican birds. This 11-year 
project has resulted in a diversity of synthetic publi¬ 
cations regarding the Mexican avifauna (Navarro- 
Siguenza et al. 1992a, b; Peterson 1993; Peterson 
et al. 1993; Peterson 1998; Peterson et al. 1998a, b; 
Navarro-Siguenza & Peterson 1999, 2000; Peter¬ 
son et al. 2000, 2001,2002). Herein, we will use this 
exemplar data set to demonstrate a variety of potential 
benefits to broad integration of data across institu¬ 
tions, as follows: 

7.1 Georeferencing as a Community 

Georeferencing locality data for specimens opens 
doors to a multitude of new capabilities and new func- 
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Fig. 2: Map of Mexico with collecting localities plotted by numbers of specimens collected at each point (graded symbol 
size: smallest = 1 specimen, largest = >100 specimens). For five points, to illustrate the redundancy of collecting localities 
among museums, we provide pie diagrams that illustrate the relative holdings of specimens from that particular site among 
scientific collections (see Acknowledgements for institutions and abbreviations). 


tionalities to collections data. Indeed, all of the 
advances of geographic information systems (GIS) 
open up to collections data once latitude and longi¬ 
tude data are available for the collecting localities for 
each specimen. Nevertheless, georeferencing collec¬ 
tions data - even once they are in electronic form 
represents an enormous task. 

Integrating this task over many institutions, however, 
takes advantage not just of having more people to 
help in a large task, but also of the redundant nature 
of the geographic sampling of birds (Fig. 2). Indeed, 
more than 25% of Mexican bird collecting localities 
occur in more than one museum, and some in more 
than 20 museums. This redundancy results from col¬ 


lections being dispersed among numerous museums 
(e.g., the specimens of Wilmot W. Brown from 
Chilpancingo, Guerrero), and from certain sites being 
especially accessible or well-known as collecting 
localities in particular regions (e.g., Cerro San Felipe, 
Oaxaca). 

A first experiment in cooperative georeferencing is 
beginning in the mammal community in the United 
States. The MAN1S network, a U.S. National Science 
Foundation-funded effort, is connecting 17 institu¬ 
tions with computerized holdings of mammal speci¬ 
mens. A first step in MAN1S' integration efforts is the 
pooling of institutional lists of localities to be georef- 
erenced; institutions are then ‘signing up’ for particu- 
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lar regions, perhaps a home state, or an area of partic¬ 
ular interest to the curator. In this way, efforts in geo- 
referencing have a direct return for a particular inves¬ 
tigator or institution, and add to the community pool 
of georeferenced information. 

7.2 Detecting Errors in Date and Locality 

Once specimen data are integrated, and have been geo¬ 
referenced, further data refinements are possible. A 
common question is that of the relative reliability of 
the data associated with specimens from different col¬ 
lectors (Binford 1989). Because of the fragmented 
and dispersed nature of collector’s material it has 
always been out of reach before. For instance, the still¬ 
living collector and ornithologist Robert W. Dicker- 
man has deposited specimens at 14 of the 32 museums 
included in our present summary; the early twentieth 
century collector Wilmot W. Brown has specimens 
distributed across 23 of the 32 museums. Once these 
data are pooled, however, new insights become possi¬ 
ble regarding collectors’ relative reliability. 



Fig. 3: Maps of collecting localities for two contrasting 
groups of collectors in Mexico: a Museo de Zoologia, 
UN AM (MZFC) expedition in Spring 1991, and the collec¬ 
tions of Mario del Toro Aviles in June 1949. Organized by 
collections date, consistencies and inconsistencies of spec¬ 
imen labeling become clear. 


Basically, by assembling the entire opus of a collector, 
and sorting specimen locality by collecting date, it is 
possible to assess how geographically reasonable the 
combination of dates and localities is. Hence, to pres¬ 
ent a contrasting pair of examples, a Museo de 
Zoologia, UNAM, expedition in 1991 scouted numer¬ 
ous sites in central and eastern Oaxaca (Fig. 3, top); 
although its route was complex, specimens from par¬ 
ticular localities were clumped in time, and a sensible 
route could be reconstructed (although, in constructing 
this example, we detected an error in our georeferenc- 
ing ... the ‘Benito Juarez’ referred to in the locality 
descriptor was the one in eastern Oaxaca, not the one 
in central Oaxaca). In stark contrast, specimens scat¬ 
tered across four museums (MLZ, LACM, FMNH, 
USNM) suggest that the infamous collector Mario del 
Toro Aviles worked at several sites across Mexico in 
June 1949; plotting these localities by date, however, 
reveals a number of points at which impossibly long 
journeys would have had to have been made in too 
short a time (Fig. 3, bottom). This result confirms ear¬ 
lier suspicions that del Toro Aviles’ dates and locali¬ 
ties are to be regarded with utmost caution (Binford 
1989; Peterson & Nieto-Montes de Oca 1996). 

This approach can be used to detect problems in col¬ 
lectors’ series, which will either be errors in date of 
collection or in collecting locality. Indeed, for an inte¬ 
grated, distributed data set consisting of the holdings 
of many institutions, it could be implemented as an 
error-seeking module that scans the data set collector 
by collector, and flags particular records as potential 
problems. These flagged specimen lists could then be 
distributed to collection curators for checking. 

7.3 Detecting Errors in Identification or 
Georeferencing 

A further refinement to specimen data also becomes 
possible, which will detect problems either in species 
identification or in georeferencing of localities. In 
essence, by viewing large quantities of occurrence 
data for a particular species, it is possible to detect 
spatial outliers, which likely represent identification 
or georeferencing problems. This process can be 
refined still further via ecological niche modeling for 
species; the ecological needs of a species are modeled 
(Peterson 2001; Peterson et al., in press) using 
high-end computational tools (Stockwell & Noble 
1992; Stockwell 1999; Stockwell & Peters 1999). 
These procedures use known occurrences of a species 
to produce a geographic view of areas meeting and 
not meeting its ecological needs; overlaying the same 
known occurrence points used to build the models 
allows identification of outlier occurrences. 

As an example of this approach, we used the known 
occurrences of the brush-finch At/apetes pi leaf us to 
































